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Abstract 


We study stochastic linear optimization problem with bandit feedback. The set of 
arms take values in an A^-dimensional space and belong to a bounded polyhedron 
described by finitely many linear inequalities. We provide a lower bound for the 
expected regret that scales as U(A^logT). We then provide a nearly optimal al¬ 
gorithm that alternates between exploration and exploitation intervals and show 
that its expected regret scales as 0(W log^^'^(T)) for an arbitrary small e > 0. 
We also present an algorithm than achieves the optimal regret when sub-Gaussian 
parameter of the noise is known. Our key insight is that for a polyhedron the op¬ 
timal arm is robust to small perturbations in the reward function. Consequently, a 
greedily selected arm is guaranteed to be optimal when the estimation error falls 
below some suitable threshold. Our solution resolves a question posed by HI that 
left open the possibility of efficient algorithms with asymptotic logarithmic re¬ 
gret bounds. We also show that the regret upper bounds hold with probability 1. 
Our numerical investigations show that while theoretical results are asymptotic the 
performance of our algorithms compares favorably to state-of-the-art algorithms 
in finite time as well. 


1 Introduction 

Stochastic bandits are sequential decision making problems where a learner plays an action in each 
round and observes the corresponding reward. The goal of the learner is to collect as much reward 
as possible or, alternatively minimize regret over a period of T rounds. Stochastic linear bandits 
are a class of structured bandit problems where the rewards from different actions are correlated. In 
particular, the expected reward of each action or arm is expressed as an inner product of a feature 
vector associated with the action and an unknown parameter which is identical for all the arms. With 
this structure, one can infer reward of arms that are not yet played from the observed rewards of other 
arms. This allows for considering cases where number of arms can be unbounded and playing each 
arm is infeasible. 

Stochastic linear bandits have found rich applications in many fields including web advertisements 
m, recommendation systems 13, packet routing, revenue management, etc. In many applications 
the set of actions are often defined by a finite set of constraints. Eor example, in packet routing, 
the amount of traffic to be routed on a link is constrained by its capacity. In web-advertisements 
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problems, the budget constraints determine the set of available advertisements. It follows that the 
each arm in these applications belongs to a polyhedron. 

Bandit algorithms are evaluated by comparing their cumulative reward against the optimal achiev¬ 
able cumulative reward and the difference is referred to as regret. The focus of this paper is on 
characterizing asymptotic bounds for regret for fixed but unknown reward distributions, which are 
commonly referred to as problem dependent bounds ID . 

We consider linear bandits where the arms take values in an A^-dimensional space and belong to a 
bounded polyhedron described by finitely many linear inequalities. We derive an asymptotic lower 
bound of ^{NlogT) for this problem and present an algorithm that is (almost) asymptotically 
optimal. Our solution resolves a question posed by |[T] that left open the possibility of efficient algo¬ 
rithms with asymptotic logarithmic regret bounds. Our algorithm alternates between exploration and 
exploitation phases, where a set of arms on the boundary of the polyhedron is played in exploration 
phases and a greedily selected arm is played super-exponentially many times in the exploitation 
phase. Due to the simple nature of the strategy we are able to provide upper bounds which hold al¬ 
most surely. We show that our regret concentrates around its expected value with probability one for 
all T. In contrast regret for upper confidence bound based algorithms concentrates only at a polyno¬ 
mial rate 0. Thus, our algorithms are more suitable for risk-averse decision making. A summary 
of our results and comparison of regrets bounds is given in Table [T] Numerical experiments show 
that its regret performance compares well against state-of-the-art linear bandit algorithms even for 
reasonably small rounds while being significantly better asymptotically. 


K -armed bandits Linear bandits 



dependent 

independent 

dependent 

independent 

Lower bounds 

ATlogT 

Vkt 

TV log T 

nVt 

Upper bounds 

ATlogT 

Vkt 

N log^+^ T 

nVt 

Efficient algorithm 

UCBl 0 

MOSS 111 

SEE (this paper) 

ConfidenceBalh 1^ 


Table 1; Summary of (problem) dependent and (problem) independent regret bounds in multi-armed 
bandits and linear bandits. We considered linear bandits over a bounded subset of N-dimensional 
subspace with A > 0. The column with bold letters presents the bounds obtained in this paper. 


Related Work: Our regret bounds are related to those described in 11], who describe an algorithm 
(ConfidenceBall 2 ) with regret bounds that scale as 0{{N^/A) log^ T), where A is the reward 
gap defined over extremal points. These algorithms belong to the class of so called OFU algorithms 
(optimism in the face of uncertainty). Since OFU algorithms play only extremal points (arms), one 
may think that log T regret bounds can be attained for linear bandits by treating them as AT-armed 
bandits, were K denotes the number of extremal points of the set of actions. This possibility arises 
from the classical results on the AT-armed bandit problem due to Lai and Robbins 0 who provided 
a complete characterization of expected regret by establishing a problem dependent lower bound of 
U( A' log T) and then providing an asymptotically (optimal) algorithm with a matching upper bound. 
But, as noted in |lT|[Sec 4.1, Example 4.5], the number of extremal points can be exponential in TV, 
and this renders such adaptation of multi-armed bandits algorithm inefficient. In the same paper, the 
authors pose it as an open problem to develop efficient algorithms for linear bandits over polyhedral 
set of arms that have logarithmic regret. They also remark that since convex hull of a polyhedron 
is not strongly convex, regret guarantees of their PEGE (Phased Exploration Greedy Exploitation) 
algorithm does not hold. 

Our work is close to EEL (Eorced Exploration for Linear bandits) algorithm developed in iITtI.eel 
separates the exploration and exploitation phases by comparing the current round number against a 
predetermined sequence. EEL plays randomly selected arms in the exploration intervals and greedily 
selected arms in the exploitation intervals. However, our policy differs from EEL as follows- 1) we 
always play fixed set of arms (deterministic) in the exploration phases. 2) noise is assumed to 
be bounded in El, whereas we consider more general sub-Gaussian noise model 3) unlike EEL, 
our policy does not require computationally costly matrix inversions. EEL provides expected regret 
guarantee of only O (c log^ T) whereas our policy PolyLin has optimal 0{N log T) regret guarantee. 
Moreover, the authors in El remark that the leading constant c in their regret bound can be set 
proportional to v/iV (see discussion following Th 2.4 in El), but this seems incorrect in light of the 
lower bound of Q,{N log T) we establish in this paper. 
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In contrast to the asymptotic setting considered here, much of the machine learning literature deals 
with problem independent bounds. These bounds on regret apply in finite time and for the minimax 
case, namely, for the worst-case over all reward (probability) distributions. ii established a problem 
independent lower bound of D,{y/KT) for multi-armed bandits, and was shown to be achievable in 
Q. For linear bandits, problem dependent bounds and well studied and stated in terms of dimension 
of the set of arms rather than its size. In ifTOl . for the case of finite number of arms, a lower bound of 
D,{'/NT) with matching upperbounds is established, where N denotes the dimension of the set of 
arms. For the case when the number of arms is infinite or form a bounded subset of a A^-dimensional 
space, a lower bound of Q{N'/T) is established in am with matching achievable bounds. 

Several variants and special cases of stochastic linear bandits are available depending on what forms 
the set of arms. The classical stochastic multi-armed bandits introduced by Robbins im and later 
studied by Lai and Robbins m is a special case of linear bandits where the set of actions available 
in each round is the standard orthonormal basis. Auer first studied stochastic linear bandits as 
an extension of “associated reinforcement learning” introduced in lfT3l . Since then several variants 
of the problems have been studied motivated by various applications. In l^flTll . the linear bandit 
setting is adopted to study content-based recommendation systems where the set of actions can 
change at each round (contextual), but their number is fixed. Another variant of linear bandits with 
finite action set are spectral bandits ESI da, where the graph structure defines the set of actions and 
its size. Several authors iiiiidm have considered linear bandits with arms constituting a (bounded) 
subset of a finite-dimensional vector space and remains fixed over the learning period. ifTSll considers 
cases where the set of arms can change between the rounds but must belong to a bounded subset of 
a fixed finite-dimensional vector space. 

The paper is organized as follows: In Section [2 we describe the problem and setup notations. In 
Section we derive a lower bound on expected regret and describe our main algorithm SEE and 
its variant SEE2. In Section]^ we analyze the performance of SEE, and its adaptation for general 
polyhedron is discussed in Sectiorj^ In Section]^ we provide probability 1 bounds on the regret of 
SEE. Einally, we numerically compare performance of our algorithm against sate-of-the-art in[^ 

2 Problem formulation 


We consider a stochastic linear optimization problem with bandit feedback over a set of arms defined 
by a polyhedron. Let C C TZ^ denote a bounded polyhedral set of arms given by 

C = {x G : Ax < b} (1) 

where A G , b G TZ^. At each round t, selecting an arm xt G C results in reward rt(xt). 

We investigate the case where the expected reward for each arm is a linear function regardless of the 
history. I.e., for any history Ht, there is a parameter 6 G [—1,1]^, fixed but unknown, such that 

E[rt(x)|'Ht] = 6'x for all t and x G C. 

Under these setting the noise sequence where i/t = rt{x.) — x'O forms a martingale differ¬ 
ence sequence. Let IFt = o'li'i, 1 / 2 , ■■ ■ , r't, xi, • • • , xj+i} denote the cr-algebra generated by noise 

events and arms selections till time t. Then is J^t-measurable and we assume that it satisfies 

for all 6 G \Tt-i] < exp{b‘^R^/2}, (2) 

i.e., noise is conditionally R- sub-Gaussian which automatically implies = 0 and 

Var(i/() < R^. We can think of R^ as the conditional variance of noise. An example of R- 
sub-Gaussian noise is Af{Q, R?), or any bounded distribution over an interval of length 2R and zero 
mean. In our work, R is fixed but unknown. 

A policy (f) := {4>i,(f>2, • • •) ^ sequence of functions (j)t : Ht-i —t C such that an arm is selected 

in round t based on the history Ht-i- Define expected (pseudo) regret of policy (j) over T-rounds as: 


r T 


Rj, {(j)) = T9 'x* - E 




(3) 


where x* = arg maxxec ^ x denotes the optimal arm in C, which exists and is an extremal poinj^of 
the polyhedron C US). The expectation is over the random realization of the arm selections induced 


'Extremal point of a set is a point that is not a proper convex combination of points in the set. 
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by the noise process. The goal is to learn a policy that keeps the regret as small as possible. We will 
be also interested in regret of the policy defined as 

T 

7?TW=™'x*-^0>(f). (4) 

t=l 

For the above setting, we can use ConfidenceBall 2 El or UncertainityEllipsoid lUl and achieve 
optimal regret of order N^/T. For linear bandits over a set with finite number of extremal points, 
one can also achieve regret that scales more gracefully, growing logarithmically in time T, using 
algorithms for the standard multi-armed bandits. Indeed, from fundamentals of linear programming 

arg max 0^x = arg max 0^x, 
xeC xe£(C) 

where S := £{C) denotes the set of extremal points of C. Since the set of extremal points is finite for 
a polyhedron, we can use the standard Lai and Robbin’s algorithm ii or UCB1 in 161 treating each 
extremal point as an arm and obtain regret bound (problem dependent) of order log T, where 
A := 9'x* — max£\x* S'x denotes the gap between the best and the next best extremal point. 
However, the leading term in these bounds can be exponential in N, rendering these algorithm inef¬ 
fective. For example, the number of extremal points of C can be of the order = 0{{2N)^). 

Nevertheless, in analogy with the problem independent regret bounds in linear bandits, one wishes 
to derive problem dependent logarithmic regret where the dependence on set of arms is only linear 
in its dimension. Hence we seek an algorithm with regret of order N log T. 

In the following, we first derive a lower bound on the expected regret and develop an algorithm that 
is (almost) asymptotically optimal. 


3 Main results 


In this section we provide a lower bound on the expected regret and present our proposed policy and 
prove the main results regarding its complexity. 


3.1 Lower Bound 


We establish through a simple example that regret of any asymptotically optimal linear bandit algo¬ 
rithm is lower bounded as H(iVlogT). Recall the fundamental property of the linear optimization 
that an optimal point is always an extremal point. Then any linear bandit algorithm on a polyhedral 
set of arms always play the extremal points. We exploit this fact, and mapping the problem to a 
standard multi-armed bandits we obtain the lower bound. 

We need the following notations to prove the result. Let {?7(/3)}/3g[o,i] denote a set of distributions 
parametrized by /3 S [0,1] and such that each ri{(3) is absolutely continuous with respect to a positive 
measure m on TZ. Let p{x; /3) denote the probability density function associated with distribution 
77(/3), and let iTL(/3i, /32) denote the Kullback-Leibler (KL) divergence between distributions r]{f3i) 

and r]{P 2 ) defined as iTL(/3i, ^ 2 ) = Pi) log p(x’fej ^(^x). Consider a set of K arms. We 

say that arm k is parametrized by jdk if its reward is distributed according to rjiPk)- 

We are now ready to state asymptotic lower bound for the linear bandit problem over any bounded 
polyhedron with positive measure . Without loss of generality, we restrict our attention to uniformly 
good policies as defined in 0. We say that a policy (j) is uniformly optimal if for all 6 G Q, 
R{T,(t>) = o{T°‘) for all a > 0. 


Theorem 1 Let (j) any uniformly good policy on a bounded polyhedron with positive measure. For 
any 6 G [0,1]^, let £[ 77 ( 0 ^)] = 9k for aW fc = 1, 2, • • • , N. Then, 


liminf - 

T->-oo logT max KL{6*,6k) 


where 6* = arg max On 

71 


(5) 


Proof sketch: First, note that number of extremal points of any bounded polyhedron with positive 
measure is atleast {N + 1). We can then restrict to a bounded polyhedron with iV -b 1 extremal 
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points. Let C = {x G TZ^ :0<a:i<lVi = l,2---, N}. The {N + 1) extremal points of C are 
{e„ : n = 1, 2, • • • , N} U {0}. In the linear bandit problem with unknown parameter 9, playing 
the extremal point e„ gives mean reward 6>„. Also, by the property of linear optimization, any OFU 
policy will only play extremal points in every round. Then, the linear bandit over polyhedron C is 
the same as + 1-armed bandit where reward of fcth arm k = 1, 2 • • • , TV is distributed as r]{9k) 
with mean 9^, and the reward of TV -f 1th arm is distributed as p(0) with mean 0. 

The result follows from Lai-Robbin’s lower bound for stochastic multi-armed bandits proved in ISl 
after verifying that the mean values of the parametrized distribution satisfy the required conditions. 


3.2 Algorithms 

The basic idea underlying our proposed technique is based on the following observations for linear 
optimization over a polyhedron. 1) The set of extremal points of polyhedron is finite and hence 
A > 0. 2) When 9 is sufficiently close to 9, then over the set C both arg max 9'x and arg max 9 x 
give the same value. We exploit these observations and propose a two stage technique, where we first 
estimate 9 based on a block of samples and then exploit it for much longer block. This is repeated 
with increasing block lengths so that at each point the regret is logarithmic. For ease of exposition, 
we first consider the polyhedron that contains origin and postpone the general case to Section]^ 

Assume that the polyhedron C = {x G TZ^ : Ax < b} contains origin as an interior point. 
Let e„ denote nth standard unit vector of dimension TV. For all 1 < n < TV, let = 
max {z > 0, zBn G C}. The subset of arms B := {zne„ : n = 1, 2 • • • , TV} are the vertices of 
the largest simplex bounded in C. Since = 9'e„ we can estimate by repeatedly playing the 
arm 'Zn^n- One can also estimate by playing an interior point ze„ G C for some z > 0. But as 
will see later selecting the maximum possible z improves the probability of estimation error. 

Algorithm-SEE 

In our policy- which we refer as Sequential- 
Estimation-Exploitation (SEE)- we split the time hori¬ 
zon into cycles and each cycle consists of an explo¬ 
ration interval followed by an exploitation interval. We 
index the cycles by c and denote the exploration and 
exploitation intervals in cycle c as Ec and Rc, respec¬ 
tively. In the exploration interval E^, we play each arm 
in B repeatedly for (2c -I- 1) times. At the end of E^, 
using the rewards observed for each arm in B in the 
past c- cycles we compute ordinary least square (OLS) 
to estimate each component = 1,2,-- - ,TV sep¬ 
arately and obtain the estimate 9{c). Using 9{c) as a 
proxy for 9, we compute a greedy arm x(c) by solving 
a linear program and play it repeatedly for 2'^ /(!+'=) 
times in the exploitation interval i?c, where e > 0 in 
an input parameter. We repeat the process for each 
cycle. A formal description of SEE is given in the ad¬ 
jacent figure. The estimation in line 13 is computed 
for all n = 1, 2, • • • , TV as follows: 

^ c 2z+l 
i=0 j = l 


Note that in the exploration intervals, SEE plays a fixed set of arms and no adaption happens, adding 
positive regret in each cycle. The regret incurred in the exploitation intervals starts reducing as the 
estimation error gets small, and when it falls below A/2 the step (line-16) selects the optimal arm 
and no regret is incurred in the exploitation intervals (Lemmaj^. As we will show later, the proba¬ 
bility of estimation error decays super-exponentially across the cycles, and hence the probability of 
playing a sub-optimal arm in the exploitation interval also decays super-exponentially. 


Algorithm 1 SEE 
1: Input: 

2: C: The polyhedron 
3: e: Algorithm parameter 
4: Initialization: 

5: Compute the set B 
6: for c = 0, 1,2, • • • do 
7: Exploration: 

8: for n = 1 —>■ TV do 

9: for j = 1 —>• 2c -f 1 do 

10: Play arm z„e„ G B, 

observe reward r* 

11: end for 

12: Compute 9n{c) 

13: end for 

14: 9{c) {9i{c),92{c) -■ ■ ,9n{c)) 

15: x(c) ^ argmaxx'0(c) 

XGC 

16: Exploitation: 

17: for j = 1 —>• do 

18: Play arm x(c), observe reward 

19: end for 

20: end for 
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Theorem 2 Let the noise be R-sub-Gaussian and without loss of generalitj^assume 0 G [—1,1]^. 
Then, the expected regret of SEE, with parameter e > 0 is bounded as follows: 


Rt{SEE) < 2R„,N\og^+^ T + 


(7) 


where Rm denotes the maximum reward. 71 is a constant that depends on noise parameter R and 
the sub-optimality gap A. 


The e parameter determines the length of the exploitation intervals, and larger e implies that SEE 
spends less time in exploitation and more time in exploration. Increasing e will make SEE spend 
more time in explorations resulting in improved estimations and reduces the probability of playing 
sub-optimal arm in the exploitation intervals. Hence parameter e determines how fast the regret 
concentrates, and larger its value more ’risk-averse’ is the algorithm. This motivates us to consider 
a variant of SEE that is more risk averse but at the cost of increased expected regret. 


3.3 Risk Averse Variant 


Our second algorithm-which we refer to as SEE2- is essentially same as SEE, except for the length 
of the exploration intervals which is exponential instead of super-exponential and does not depend 
on e. Specifically, we play the greedy arm 2^^ times in cycle c. Compared to SEE, SEE2 spends 
significantly more time in the exploration intervals, and hence the probability that it makes error 
in the exploitation intervals is also significantly smaller and thus its regret concentrates around the 
expected regret faster. 


Theorem 3 Let the noise be R-sub-Gaussian and 6 G [—1,1]^. Then, the expected regret ofSEE2 
is bounded as follows: 


Rt{SEE 2) < 2RmNlog^ T + 4iVi?™72 


( 8 ) 


where 72 is a constant that depends on noise parameter R and the sub-optimality gap A. 


4 Optimal Algorithm. 


We next obtain an optimal algorithm that achieves the lower bound in (|^ within a constant factor 
when the sub-Gaussian parameter R is known. 


^For general 0, we replace it by 



and the same method works. Only Rm is scaled by a constant factor. 
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Algorithm-PolyLin: 

In our next policy- which we refer as Polyhedral- 
Linear-bandits we again split the time horizon into 
cycles consisting of an exploration interval followed 
by an exploitation interval as in SEE. As earlier, we 
index the cycles by i and denote the exploration and 
exploitation intervals in cycle i as Ei and Ri, re¬ 
spectively. In the exploration interval Ei, we play 
each arm in B once. After c-cycles, using the re¬ 
wards observed for each arm in B in the past {Ei, i = 
12,-•• ,c} exploration intervals we compute ordi¬ 
nary least square (OLS) to estimate each component 
9n,n = 1,2, - ■ ■ ,N separately, and obtain the esti¬ 
mate 9{c) as follows. 

1 ° 

^n{c) = (9) 

Using 9{c) as a proxy for 9 we compute a greedy arm 
x(c) and the sub-optimality gap A(c) as follows. 

A(c) = x'(c)9(c) — max x'0(c). 

XGC\X(c) 

In the exploitation interval Rc, we play x(c) repeat¬ 
edly for times where k{c) is set to aA(c)/2, 

where a = min„z„/i?^. We repeat the process for 
each cycle. A formal description of PolyLin is given 
in the adjacent figure. 


Algorithm 2 PolyLin 

1 

Input: 

2 

C: The polyhedron 

3 

R: Noise parameter 

4 

Initialization 

5 

Compute the set B 

6 

a := min„ z\/R^ 

7 

for i = 1,2, • • • do 

8 

Exploration; 

9 

for n = 1 —>■ A do 

10 

Play arm ZnSn G B 


observe reward rt^ „ 

11 

c = i. Compute (c) as in (|^ 

12 

end for 

13 

9{c)^ {0iic),e2{c)--- ,§Nic)) 

14 

x(c) ^ argmaxx'0(c) 


xec 

15 

k(c) ^ aA(c)/2 

16 

Exploitation: 

17 

for j = l-t do 

18 

Play arm x(c), observe reward 

19 

end for 

20 

end for 


Note that the exploration intervals of PolyLin are hxed length, whereas in SEE they are increasing 
as the the time progresses. Also, exploitation intervals in PolyLin are adaptive, whereas it is non- 
adaptive in SEE. 

Theorem 4 Let the noise be R-sub-Gaussian and without loss of generality assume 9 G [—1,1]^. 
Then, the expected regret ofPloyLin is bounded as follows: 

RriPolyLin) < 2R^N^-^ + iRmNys, (10) 

K 

where R^ denotes the maximum reward. 73 and k are constants that depends on noise parameter 
R and the sub-optimality gap A. 


5 Regret Analysis 


In this section we prove Theorem the proof of Theorem follows similarly and omitted. We 
first derive the probability of error in estimating each component of 9 in each cycle. Note that in the 
exploration stage of each cycle c we sample each arm z„e„ G B,i = 1,2, - ■ ■ ,N,2 times more than 
that in the exploration stage of the previous cycle. Thus, we have plays of each arm G B at 
the end of cycle c. The estimation error of component after c-cycles is given as follows: 


Lemma 1 Let the noise be R-sub-Gaussian and S > 0. In any cycle c of both SEE and SEE2, for 
all n = 1,2, ■■ ■ , N we have 


P 




> s'j < 2exp{—c^S^z^/2R^}. 


( 11 ) 


Note that larger the value of Zn, the smaller the probability of estimation error is. The next lemma 
gives the probability that we play a suboptimal arm in the exploitation intervals of a cycle. 
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Lemma 2 For every cycle c, we have 


a. Let a := min^ z'^jBr. The estimation error is bounded as 

Pr{||0(c) - 0||oo > rj} < 2iVexp{-c^ry^a}., 

b. Let h = supxgc INII i- error in reward estimation is bounded as 


Pr ^3 X S C such that 


9 (c)x — O'nt. > itJ < 2Ne . 


( 12 ) 


(13) 


c. Probability that we play a sub-optimal arm is bounded as 


( - , , \ ac3AV4 

Pr argmaxSfc) X ^ are max 0 x. < 2Ne . (14) 

\ xec xec J 


The proofs of Lemmas andare given in appendix. Recall that the number of extremal points is 
finite for the polyhedron C and A > 0. We use this fact to argue that whenever ||0(c) — 0|joo < A/2, 
the greedy stage of the algorithm selects the optimal arm. This in an importation observation and 
follows from continuity property of optimal point in linear optimization theory US). Further, the 
probability of this event decays super-exponentially fast in our policy implying that the probability 
that we incur a positive regret in the exploitation intervals is gets negligibly small over the cycles. 
We compute the expected regret incurred in the exploration and exploitation intervals separately. 


5.1 Regret of SEE. 


We analyze the regret in the Exploration and Exploitation phases separately as follows. 
Exploration regret; At the end of cycle c, each arm in B is played + 1) = times. The 

total expected regret from the exploration intervals after c cycles is at most Nc^Rm- 
Exploitation regret: Total expected regret from the exploration intervals after c cycle is 


E . 2 /(l + e) -2 a 2 

2^ 2 -* 


<4iVi?„72 


(15) 


where 72 := 1 2 *^* 


(i-0/(i+0_ 


ci*A^/ 4 ) ^ convergent series. After c cycles, the total number of 


plays is T = X]i=i Nc^ > and we get < log^’’"'^ T. Finally, expected regret form 

T-rounds is bounded as 


Rt{SEE) < 2i?^Alog^+"r + 4Ai?™72 =0(Wlog^+^T). 


5.2 Regret of PolyLin. 


We analyze the regret in the Exploration and Exploitation phases separately as follows. 
Exploration regret; After c cycles, each arm in B is played c times. The total expected regret from 
the exploration intervals after c cycles is at most NcRm- 

Exploitation regret: Total expected regret from the explorations interval after c cycles is 


4NRm F, 

2 = 1 


ANR^ < ANRm Y 

2=1 2=1 


(16) 


Now consider the series 73 := ^^1. 

• Erom Lemma 1^ a), 6{c) —> 0 as c —)■ c»almost surely, we get x(c) —>■ x* almost surely 
and which in turn implies A(c) —>^ A almost surely. 

• Then, for 0 < e < A^/4, the difference A(c)^/2 — A^ < —A^/2 + e < 0 for all but 
finitely many c. Hence, 73 is finite. 
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After c cycles the total number of plays is T = YTi=i + Nc > and we get c < 

Finally, expected regret form T-rounds, as T —oo, is bounded as 

log; T 

RriPolyLin) < 2RmN —— + 4A^iim73. 

K^Cj 

Note that A(c)^/2 — > —12^12 — e for all but finitely many c. Then for sufficiently large c we 

get k{c)la > A^/2 — e> jA. Substituting in the last inequality we get 

RriPolyLin) < SRmN —^—l-4Ai?m73 = O(A^logT). 


6 General Polyhedron 


In this section we extend the analysis of the previous section to the case where origin is not an 
interior point of C. 


Analogous to set B, we first define a set of arms that lie on the boundary of the polyhedron and these 
points are computed with respected to an interior point x of C that we use as a proxy for origin. We 
use OPT-1 to find an interior point, whose smallest distance to boundaries along all the directions 
{ei, 02 , • • • ejv} is the largest. The motivation to maximize the minimal distances to the boundaries 
comes from lemma]^ where larger value of a imply smaller probability of estimation error. 

OPT-1: OPT-2: 


(x,y) = argmaxminj/i 

X i 

subjected to: 

Ax < b 

2/, > 0 Vi = 1,2, • • • , A 

A(x + yiB,) < b Vi = 1, 2, • • • , A 

A(x - yiB,) < b Vi = 1, 2, • • • , A 


(x, y) = arg max a 
x.y.a 

subjected to: 
a > 0; Ax < b 
yi — a > 0 Vi = 1, 2, • • • , A 
A(x + y^Bi) < b Vi = 1, 2, • • • , A 
A(x - y^Bi) < b Vi = 1, 2, • • • , A 


OPT-1 can be translated into an equivalent linear progamme given in OPT-2 and hence the point x 
can be efficiently computed. We note that the set of points {x -f t/„e„ : n = 1, 2, • • • , A} need not 
all necessarily lie on the boundary. To see this, let the point x returned by OPT-1 is such that it is 
closer to the boundary along ith direction. Then the vector with all its component equal to yi is a 
solution of OPT-1. To overcome this, we further stretch each point x + yn^n along the direction e„ 
such that it hits the boundary. Let = argmax^d^j : 2 e„ G C}. Finally, we fix the set of arms 
we use for explorations as = {z„e„ -f x : n = 1, 2, • • • , A}. 

We are now ready to present an algorithm for linear bandits over for any polyhedra. For the general 
polyhedron, we use the SEE with the exploration strategy modified as follows. In cycle c, we first 
play the arm x for 2c -f 1 and then play each arm in 2c -f 1 times as earlier. To estimate the 
component 0„, we average the difference in rewards observed from arms x + and x so far. 
From a straightforward modification of regret analysis of SEE, we can show that the expected regret 
of modified algorithm is upper bounded as 0{N log^’’"'^ T) for all e > 0. 


The new algorithm required that we play the arm x along with the arms in B in the exploration 
intervals to obtain estimate of 6, and it increases the length of exploration intervals. However, it is 
possible that one can obtain estimates only by playing arms in B provided we suitably modify the 
estimation method. More details are given in the appendix. 


7 Probability 1 Regret Bounds 

Recall the definiton of expected regret and regret in 0 and Q. In this section we show that with 
probability 1 , the regret of our algorithms are within a constant factor from the their expected regret. 

Theorem 5 With probability 1, RriSEE) is ©(A log^+'^ T) and Rt{SEE2) is 0[N log^ T). 
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Figure 1: Regret comparison against multi-armed Figure 2; Regret comparison against linear 

bandits, arms are comers of 10-dim. hypercube. bandit algorithms on 10-dim. hypercube. 

Proof: Let C„ denote an event that we select sub-optimal arm in the nth cycle. From Lemma 
H this event is bounded as Pr{Cn} < iV exp{—0(n^)}. Hence < oo. Now, 

Rom application of Borel-Cantelli lemma, we get Pr{limsup„_^ 3 o C„} = 0, which implies that 
almost surely SEE and SEE2 play optimal arm in all but finitely many cycles. Hence the ex¬ 
ploitation intervals contribute only a bounded regret. Since the regret due to exploration inter¬ 
vals is deterministic, the regret of SEE and SEE2 are within a constant factor from their ex¬ 
pected regret with probability 1, i.e., Pr{3 Ci such that < Rt{SEE) + Ci} and 

Pr{3 C 2 such that Rt{SEE2) < Rt{SEE2) + 02 }- This completes the claim. 

We note that the regret bounds proved in hold with high confidence, where as ours hold with 
probability 1 and hence provides a stronger performance guarantee. 

8 Experiments 

In this section we investigate numerical performance of our algorithms against the known algo¬ 
rithms. We run the algorithms on a hypercube with dimension N — 10. We generated 6 G [0,1]^ 
randomly and noise is zero mean Gaussian random variable with variance 1 in each round. The 
experiments are averaged over 10 runs. In Eig. 1 we compare SEE (e = 0.3) and SEE2 against 
UCB-Normal ll20l . where we treated each extremal point as an arm of an 2^-armed bandit problem. 
As expected, our algorithms perform much better. UCB-Normal need to sample each of the 2^ 
atleast once before it could start learning the right arm. Whereas, our algorithm starts playing the 
right arm after a few cycles of exploration intervals. In Eig. 2, we compare our algorithms against the 
linear bandits algorithm LinUCB and self-normalization based algorithm in lITSl . which is labeled 
SelfNormalized in the figure. Eor these we set confidence parameter to 0.001. We see that SEE beats 
LinUCB by a huge margin, but its performance comes close to that of SelfNormalized algorithm. 
Note that SelfNormalzed algorithm requires knowledge of sub-Gaussianity parameter R of noises 
super. Whereas, our algorithms are agnostic to this parameter. Though, SEE2 seems to play the 
right arm in exploitation intervals, its regret performance is poor. This is due to increased number 
of exploration intervals, where no adaptation happens and a positive regret is always incurred. 

The numerical performance of SEE2 can be improved by adaptively playing the arms in the explo¬ 
ration plays as follows, but at the increase cost of computations complexity. In each cycle c + 1, we 
find a new set B computed by setting x to x(c), the greedy atm selected in the previous cycle, and 
play the new set arms as in the explorations intervals of the algorithm given for the general polyhe¬ 
dron. However, since x(c) is an extremal points some of the z^’s are zero. To overcome this, we 
slightly shift the point x(c) into the interior of the polyhedron along the direction x(c) — x and find 
a new set B with respect to the new interior point. The regret of the algortihm based on this adaptive 
exploitation strategy is shown is Fig. 2 with label ’Improved-SEE2’. As shown, the modification 
improves performance of SEE2 significantly. In all the numerical plots, we initialized the algorithm 
to run from cycle number 5. 


9 Conclusion 

We studied stochastic linear optimization over polyhedral set of arms with bandit feedback. We pro¬ 
vided asymptotic lower bound for any policy and developed algorithms that are near asymptotically 
optimal. The regret of the algorithms grow (near) logarithmically in T and its growth rate is linear 
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in the dimension of the polyhedron. We showed that the regret upper bounds hold almost surely. 
The regret growth rate of our algorithms is log^’’"'^ T for some e > 0. It is interesting to develop 
strategies that work for e = 0, while still maintain linear growth rate in N. 
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Proof of Lemma [T] 


Let et^ „ j denote the noise in reward from playing 2 :„e„ in phase i for the jth time. We bound the 
estimation error as follows: 


P 




> (5 


= P 




Z=1 


= P \ S 


2=1 


> SC^ZrtS 


= P exp < s 


2 = 1 


> exp{sc^z„(5} 


< 2P exp > > exp{sc^z„(5} 


i=l 


< 2E 


expE^e*, ^ . 


2=1 


exp{— sc^ 2 ;„( 5 }} 


< 2 J|E [exp {set-„ \Pt-i] exp{-sc^ZnS}} 

2=1 

— 2 J[j[ exp{s^/3^/2} exp{—sc^ 0 „( 5 }} 


= 2exp{c (s ^ /2 - sz„(5)}}, 


(17) 

(18) 

(19) 

( 20 ) 

( 21 ) 

( 22 ) 

(23) 

(24) 

(25) 


where ( [T8] l follows from estimation step given in (j^. In ( [T9] l and ( [ 20 | w e exponentiated both sides 
within the probability functions after multiplying them by s > 0 . (| 21 [) follows by applying union 
bound and using the symmetric property of the noise terms. In ( |22] l we applied the Markov inequal¬ 
ity. In (j2^ we aplied conditional independence property of the noise. ( [24| follows by applying the 
dehnition of sub-Gaussian property. 

Note that upper bound in (|25| holds for all s > 0 and is minimized at s* = ^ > 0. Finally, the 
lemma by substituting s* in (| 2 h|l. 


Proof of Lemma |2] 

Part a: 

We bound the estimation error as follows: 


( 0{c) — 6 > 77 ) 

V 00 / 

(26) 

< Pr (3n : 0„(c) - 9^ > t]) 

(27) 

N 

- E ^ V 

(28) 

n—1 


(29) 


In ( |28| l we applied the union bound result and in ( |29l l we applied O- 
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Part b: 


For all X e C, we have 

|x'0(c)-x'0| < 110(c)-0||oo|ix||i. (30) 

Define events A = {3 x such that |x'0(c) — x'0| > rj} and B = {||0(c) — 9\\ooh > ry}. The last 
inequality implies Pr{^} < Pr{,8}. The claim follows from part-a of the lemma. 

Part c: 

Suppose y 7 ^ X*, where x* is the optimal arm, such that 6'{c)y > 6'{c)x*. Then, since 
O'x* — 9'y > A we must have that either |0^x* — 0'(c)x*| > A/2 or |0'(c)y — 9'y\ > A/2, 
otherwise we cannot close the gap. Hence, if the greedy selection in cycle c is not x*, it implies that 
there exists a x e C such that |0'(c)x — 0x| > A/2. From part-b this probability is bounded as 
2N exp{—ac^ry^/h}, where ry = A/2. This completes the proof. 


Estimation in the case general polyhedron 

Let Xi = X + otiBi. Let fi(c) := YTi=i « j denote the average of the reward obtained 

from arm x^ till end of phase m. At the end of phase m, we estimate 0 as follows: 


0 (c) = (lx'+ £)(q:)) f(c), 

where cx denote the diagonal matrix with diagonal elements as a and f (c) is the vector with fth 
component as fi(m). By applying matrix inversion lemma we get 


0 (c)= n-i(a)- 


D~\oL)15i!D-\a.) 
x'D“^(q:)1 

After simplification, for each i = 1, 2,, • • • , A we have 


1 ( 

0.(c) = - f.(c)- ^^^V ; 


Substituting the reward from arm x^, i.e.. 


Tv. = x '0 + UiOi + e 


and further simplifying we get 


9i{c) = — I arOi - x '0 + y^/3jej-(c) 


N 


where Pj = — and tj{rn) is the noise average from playing arm x^. 
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